-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add model merge
example
#5741
base: master
Are you sure you want to change the base?
Conversation
For this Pr, I think in addition to merge two model, It should also add feature to evaluation of a single layer multiple times. |
@sorasoras Yeah I think I'll try that next. For the moment, I couldn't yet tested this PR. Also, I planned to start by simply process layer-by-layer, that way I don't modify any offset (and thus no changes to metadata). The function that you mentioned requires changing metadata which I haven't yet got time to look into. But definitely something I'll try in the future. |
That's fair, but I was thinking changing metadata is easier to implement and test on existing models. |
I would be interesting in layer interleaving. Is this only for merging layers' weight linearly? Or can it do pass through? Also this line is not entirely clear: Most frankenmerges for passthough are done so:
Can this kind of repeat of blocks be done with this code? |
@dnhkng Yeah in fact I have a typo error in This PR only aims to merge the weight linearly, meaning it does not add or remove any layers to the merged model. One thing I don't understand in the lazy merge kit format though, can you please clarify it?: does the interleaving means some layers are repeated (for example, [0-20] + [10-30] results in [0-10] + [10-20] + [10-20] + [10-30]) Thank you in advance. |
It's true that the logic for my CONFIG argument is not correct. In fact, it should always be used with the "scale". For example, if I want to take 0-7 from model A and 8-12 from model B: CONFIG1 = But I'm planning to re-design the whole thing though, to prepare support for the "repeated layers" option |
This would result in: This is why Frankenmerge models are larger than base models. Personally, I would be interesting in a hybrid approach, with the ability to merge and layer! Trying to stay with your parameter notation, the closest I could get for the 2 configs would be: As both configs must be the same length, for model_a we used |
Thanks for the explanation.
According to discussion #4718 , gguf format maybe benefit by pointing 2 weights on metadata to the same tensor, this way we can have 2 or more layers using same weights. I haven't tried this though, but surely it's essential if we want to have repeated layers.
Having both merge + repeated layers is great. But for that, I think the whole notation that I invented
The file above results in output model having:
It's not as robust as lazy merge kit syntax (yml), but give us more space to improve in the future. Additional, someone can easily write a python script to convert lazy merge kit yml to my syntax. What do you think about this approach? |
Sure, I think we should do it. I was about to start testing Mergekit now, but I can quickly switch gears and write Python converter script.
Yes, that would be a better method. I have a large model I know quite well I've merged manually in ExllamaV2.It took a bit to sort out KV caching though, and there are issues when the model spans multiple GPUs. At first, I would just duplicate. If you can generate the merging code, I can compare the results of your method to the measured result of my merge. Update: I could write the Python converter, but now that I look in more detail, I think the layer-by-layer method here is much more powerful. Mergekit only allows either slice interleaving OR linear/spherical interpolation of all layers. The config model you describe is more verbose, but much more powerful. I would prefer that TBH. TBH, there are two options, 1) easy parsing with just 3 values:
Or YAML, and give all the details:
|
Thanks for the input, I'll need to rework this PR in the next days. Regarding the format, I still having ability to specify weight of a and b separately can be interesting. I don't know what will happen if we take The csv format should simplify the cpp parser code though, I'll consider that. YML format is readable, but unfortunately we can never include a yml parser in llama.cpp. However, having it as the input of your python script (and the python convert that yml into csv or something llama.cpp can understand) will be very useful. |
Yes, the YAML could be converted to CSV easily, if we leave out various interpolation types. For completeness, I would explicitly put in all weights, and normalise to reach a sum of 1.0
and for three models:
The last layer here gets normalised to 1/3, 1/3, 1/3. |
@dnhkng I updated by PR to have the ability to:
To simplify my CSV parsing code, I choose the column in order "model - scale - model - scale" (instead of "model - model - scale - scale"
If you add the third model, the columns become "model - scale - model - scale - model - scale" I tried it myself and confirmed that the output model can be loaded, inference without any problem. What I could not verify is that the merging result (semantic result) is good or not (in other words, did it do |
FYI, I was also thinking adding ability to merge quantized model, but at this stage it's quite tricky: I must dequantize it, do calculations with |
Could you add a branch for pass-through (no linear interpolation) of quantized models? I have a use case for that right now! i.e. a single model quantized model, with repeating layers. This issue is that, from my tests, model self-merging only starts to help from 34B models and up. At FP16, that's a huge amount of RAM required! I have a model that is a positive outlier on a difficult LLM benchmark, so it should be relatively clear whether the merge worked. It's a 70B model, so I'll need to run the tests on an 80Gb GPU. Interpolating layers would be an added benefit in the future though! I will pull your code and try on FP16 Llama7B now, but I know all outputs will be worse than the base model. However, I know regions of "really bad", and "slightly bad", so I can see if it is at least making sense. |
I'll try quantized model later. At least, loading a q4_K model then output it as f16 is not too complicated. Only requant part is too tricky for me. Also, just for my curiosity: if you merge the model then use One thing I'll try to work on is ability to re-use same tensor for repeated layer. For now, if the output model has duplicated layer, the associated tensor data will be duplicated (not ideal) |
Reusing layers makes sense, but the caching is tricky. There's a discussion on my pull request for ExllamaV2 here: turboderp/exllamav2#275 |
I can try Q4 -> FP16 and re-quantization. I'll keep watching this pull request, and test it when it's ready. Intermediate disk space is fine, I have a few SSD Tb free ;) |
Personally thinking, shared cache among layers is not something technically possible though. While the weight is the same, KV is calculated by embedding from the layers before it (correct me if I'm wrong). For example, when you have 2 consecutive layers having same weight P/s: I'm actually bad at math when I was in high school / university. Nowadays with all these machine learning stuff, I still imagine "tensor" to be "rubik cube" in my head |
Yes, you can't share cache, it would get overwritten on the higher layer processing... But it still works! The results are worse though, but that's not unexpected. The fact that it even slightly works is crazy though. I have done quite a lot of testing on various permutations of layers, and most are worse. but there are a few interesting combinations. GGUF would be the best way to share them, as going via FP16 torch tensors, then merging, then converting to GGUF and finally quantization seems like a lot of wasted effort! Better to experiment in ExllamaV2 dynamically and build and distribute in GGUF. |
Tested it with a self-merge today on F16, and it looks good! I will fire up an evaluation pipeline over the weekend, and do more extensive testing. Just to clarify: Also, Mergekit offers Spherical linear interpolation (SLERP). This seems to offer better merges. (brief description here). |
Thanks! Glad to know that it works in your test. Only linear merging is supported for now. SLERP is interesting too and technically possible (because internally we dequantize all matrix to What's not clear for me though: SLERP works with vector, but we have matrix as model weight. How can SLERP apply to matrix? For example a matrix 4x4, will it be consider as a vector of 16 dimensions, or 4 vectors of 4 dimensions each? |
In PyTorch it seems straightforward. The implementation is here, from line 94: I have just bought some cloud compute to test the merged model; I need 80Gb VRAM for it to run at a useful speed. It will take a few hours at least. |
Oh ok thanks for the info. Seems like in the python code, there is no place where the tensor view is changed to 1d. That mean it keeps one row of matrix == one vector. I can wait, don't worry. I'm trying to refactor the re-quantization part in another PR, so we should get some more performance when having quantized model as output. |
Great! Im merging a 70B model, and its not super fast. Many layers are with a 1.0/0.0 weight ratio. Maybe as a backlog item, if a new layer has 100% weight from a model, skip dequantization, merging and re-quantization, and just pass through the layer with 100% weight. Not urgent though. It looks like the merge will take about 30 minutes. |
FYI, I've just pushed a refactor commit that has better multi-thread usage for re-quant operation (using same code as |
I had a look on mergekit + slerp today. I think I can add slerp in this PR, as it make more sense than linear method. However, I will need to re-invent my input format. On the blog article, they target specifically some tensors, for example
The current CSV format does not allow specify scaling at tensor level. Therefore, I propose a new format which is inspired by assembly language:
I don't know if it's too complicated for your converter script @dnhkng ? |
OK, the 70B Model merge looks interesting. The merges go in the same direction I see with ExllamaV2, so I think everything is working OK! I have one small issue, that I'm trying to figure out still though. I use EQ-Bench to test the models, and weirdly, using llama.cpp server I get significantly worse results than using exllama via oobabooga. The relative changes are all correct, but the absolute scores for the llama.cpp backend are about 75.5, using the original leaked Miqu Q4/5 weights. However, I get a score of 82.7 for the Q4 weights with exllamaV2! A 7 points difference here is massive. This is extra weird, as the exllama weights are just the Miqu weights that have been de-quantized, converted and re-quantized, so you would not expect them to be so much better (I would expect them to be slightly worse). I've made an issue at the benchmark repo, but maybe someone here might know why this is the case.
All good, write a config style you like, and I'll write up a python converter :) So long as the format is sensible, it should be easy to generate a High-Level abstraction. The fallback is to write low-level by hand, for unusual cases. The combination is powerful. |
For the benchmark difference between llama.cpp server and exllama, apart from the chat template that I discuss in the other issue, maybe it's also because KV cache of llama.cpp is f16 by default. (Idk if exllama use f16 or bf16 or f32 for KV; pay attention that even model is quantized, the KV may not be quantized) I'll start working on the slerp and the new input format today, as the current implementation already output an usable result. |
@dnhkng I added the new format and SLERP, it's slightly different that what I proposed above, but will be easier to understand:
You can have a look at I've tried merging a dolphin-mistral with vistral (mistral but finetuned to understand vietnamese). The output model does speak mixed eng-viet which indicate that my code kinda work. The used merge config is Feel free to ask if something is not clear for you. Thank you! |
I'll try and write a high level python configuration generator for the new format. |
If anyone is interested, then I think we should in theory be able to get a better estimate of the original fp16 values for the Miqu model by combining the q_5, q_4 and q_2 quantized values. I don't really know what criteria llama.cpp is using to quantize the values, but I assume it's to minimise the least squares error? If so then I think we can assume the values come from a normal distribution and then work out the correct weighting factor for the 3 different bin centres we have for every original fp16 value that was quantized. This obviously won't work for some distributional assumptions, eg: if the original fp16 values came from a uniform distribution, then knowing which of the 4 bins the q_2 came from and which if the 16 bins the q_4 came from gives us no extra information over knowing which of the 32 bins the q_5 came from and the maximum likelihood estimate is still just the centre of the q_5 bin (assuming the bin boundaries all align anyway). But I think the values are pretty likely to have come from an approximately normal distribution (especially due to all the layer norms in the model, etc) and the correct weightong factors should be findable either analytically or empirically. Without explicitly working it out, I think the weights will likely be something like the #bins ratio squared (ie: using the conjugate prior formula), but I'm pretty sure it could be worked out empirically quite easily if we know the exact criteria the quantization is using. It probably won't be a huge increase and at best be around the level of q_6, but it would likely be useful for those remerging the de-quantized fp16 model off huggingface. |
Yeah, to work it out analytically looks quite hard: but it wouldn't be hard to estimate the weights empirically as we could just simulate the forward quantization process used to create a q_5, q_4, and q_2 of a standard normal (using least squares criteria or whatever llama.cpp is using) and then find the optimal weighting factors to get the maximum likelihood estimate of the original fp16 value (or something similar anyway). It may turn out to be a different weighting factor for each of the 32×16×4 combinations, but even this wouldn't be hard to find empirically via simulation. |
I agree with that: since we're using qX_K and not qX_0 or qX_1, the difference between 16 bins of q4 and 32 bins of q5 is not that much. Throwing q2 into the equation may make it worse. I assume that dequantizing q5 is already the best result we can get. |
Btw @dnhkng I came across the code for merging embedding & output layers of mergekit, seems like it's also an important part to improve the quality of output model. I'll try to implement that in this week, but quite tricky because sometimes we have models with different vocab size (i.e. added special tokens) |
Will that mean a new format for the configuration? |
No, don't worry, it will be just an additional (optional) command to add to the current format |
OK, I've written a YAML parser that converts high-level config files to your format, including some quite complex merges. |
@dnhkng Thank you! Seems good, I'll try it tomorrow |
haven't seem any progress, any update? |
Yeah sorry I was quite busy since then. The python converter script looks good, but merging this PR (the part that I made) into master is quite risky, since it's quite huge and I doubt if anyone find it helpful in the future. For now, I think we can consider this PR as a demo. But you can feel free to let me know if you want to change something else. |
I don't know if it's a good idea or not.
Still WIP, not tested, would be nice if some one can test it out.